Goto

Collaborating Authors

 Soweto


Long-form factuality in large language models Jerry Wei 1 Chengrun Y ang 1 Xinying Song 1 Yifeng Lu

Neural Information Processing Systems

To benchmark a model's long-form factuality in open domains, we first use GPT -4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE).



Towards culturally-appropriate conversational AI for health in the majority world: An exploratory study with citizens and professionals in Latin America

arXiv.org Artificial Intelligence

There is justifiable interest in leveraging conversational AI (CAI) for health across the majority world, but to be effective, CAI must respond appropriately within cultur ally and linguistically diverse context s . Therefore, we need ways to address the fact that current LLMs exclude many lived experience s globally . Various advances are underway which focus on top - down approaches and increas ing training data . In this paper, we aim to complement these with a bottom - up locally - grounded approach based on qualitative data collected during participatory workshops in Latin America. Our goal is to construct a rich and human - centred understanding o f: a) potential areas of cultural misalignment in digital health; b) regional perspectives on chatbots for health and c) strategies for creating culturally - appropriate CAI; with a focus on the understudied Latin American context . Our findings show that academic boundaries on notions of cultur e lose meaning at the ground level and technologies will need to engage with a broad er framework; one that encapsulates the way economics, politics, geogr aphy and local logistics are entangled in cultural experience. To this end, we introduce a framework for ' Pluriversal Conversational AI for H ealth ' which allows for the possibility that more relationality and tolerance, rather than just more data, may be called for .


CogSimulator: A Model for Simulating User Cognition & Behavior with Minimal Data for Tailored Cognitive Enhancement

arXiv.org Artificial Intelligence

The interplay between cognition and gaming, notably through educational games enhancing cognitive skills, has garnered significant attention in recent years. This research introduces the CogSimulator, a novel algorithm for simulating user cognition in small-group settings with minimal data, as the educational game Wordle exemplifies. The CogSimulator employs Wasserstein-1 distance and coordinates search optimization for hyperparameter tuning, enabling precise few-shot predictions in new game scenarios. Comparative experiments with the Wordle dataset illustrate that our model surpasses most conventional machine learning models in mean Wasserstein-1 distance, mean squared error, and mean accuracy, showcasing its efficacy in cognitive enhancement through tailored game design.


Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation

arXiv.org Artificial Intelligence

Large Language Models (LLMs) demonstrate promising capabilities in solving simple scientific problems but often produce hallucinations for complex ones. While integrating LLMs with tools can increase reliability, this approach typically results in over-reliance on tools, diminishing the model's ability to solve simple problems through basic reasoning. In contrast, human experts first assess problem complexity using domain knowledge before choosing an appropriate solution approach. Inspired by this human problem-solving process, we propose a novel two-component fine-tuning method. In the first component World Knowledge Distillation (WKD), LLMs learn directly from solutions generated using tool's information to internalize domain knowledge. In the second component Tool Usage Adaptation (TUA), we partition problems into easy and hard categories based on the model's direct answering accuracy. While maintaining the same alignment target for easy problems as in WKD, we train the model to intelligently switch to tool usage for more challenging problems. We validate our method on six scientific benchmark datasets, spanning mathematics, climate science and epidemiology. On average, our models demonstrate a 28.18% improvement in answer accuracy and a 13.89% increase in tool usage precision across all datasets, surpassing state-of-the-art models including GPT-4o and Claude-3.5.


Air pollution in South Africa: affordable new devices use AI to monitor hotspots in real time

AIHub

Air quality has become one of the most important public health issues in Africa. Poor air quality kills more people globally every year than HIV, TB and malaria combined. Air pollution makes people less productive because they get headaches and feel tired. India, for example, has poor air quality. The impact of India's poor air quality on its gross domestic product is about US 100 billion every year.


Refining Skewed Perceptions in Vision-Language Models through Visual Representations

arXiv.org Artificial Intelligence

Large vision-language models (VLMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that relying on visual representations from CLIP, as opposed to text embedding, is more practical to refine the skewed perceptions in VLMs, emphasizing the superior utility of visual representations in overcoming embedded biases. Our codes will be available in here.


Long-form factuality in large language models

arXiv.org Artificial Intelligence

Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user's preferred response length (recall). Empirically, we demonstrate that LLM agents can outperform crowdsourced human annotators - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding that larger language models generally achieve better long-form factuality. LongFact, SAFE, and all experimental code are available at https://github.com/google-deepmind/long-form-factuality.


How Elon Musk Went from Superhero to Supervillain

The New Yorker

In 2021, Elon Musk became the world's richest man (no woman came close), and Time named him Person of the Year: "This is the man who aspires to save our planet and get us a new one to inhabit: clown, genius, edgelord, visionary, industrialist, showman, cad; a madcap hybrid of Thomas Edison, P. T. Barnum, Andrew Carnegie and Watchmen's Doctor Manhattan, the brooding, blue-skinned man-god who invents electric cars and moves to Mars." Right about when Time was preparing that giddy announcement, three women whose ovaries and uteruses were involved in passing down the madcap man-god's genes were in the maternity ward of a hospital in Austin. Musk believes a declining birth rate is a threat to civilization and, with his trademark tirelessness, is doing his visionary edgelord best to ward off that threat. Shivon Zilis, a thirty-five-year-old venture capitalist and executive at Musk's company Neuralink, was pregnant with twins, conceived with Musk by in-vitro fertilization, and was experiencing complications. "He really wants smart people to have kids, so he encouraged me to," Zilis said.


Data Augmentation for Neural NLP

arXiv.org Artificial Intelligence

Data scarcity is a problem that occurs in languages and tasks where we do not have large amounts of labeled data but want to use state-of-the-art models. Such models are often deep learning models that require a significant amount of data to train. Acquiring data for various machine learning problems is accompanied by high labeling costs. Data augmentation is a low-cost approach for tackling data scarcity. This paper gives an overview of current state-of-the-art data augmentation methods used for natural language processing, with an emphasis on methods for neural and transformer-based models. Furthermore, it discusses the practical challenges of data augmentation, possible mitigations, and directions for future research.